924 resultados para Language identi¯cation, universal background model, vocal tract length normal- ization, phone recognition, output score fusion, multi-lingual phone recognition, cross-lingual speech recognition


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Automatic spoken Language Identi¯cation (LID) is the process of identifying the language spoken within an utterance. The challenge that this task presents is that no prior information is available indicating the content of the utterance or the identity of the speaker. The trend of globalization and the pervasive popularity of the Internet will amplify the need for the capabilities spoken language identi¯ca- tion systems provide. A prominent application arises in call centers dealing with speakers speaking di®erent languages. Another important application is to index or search huge speech data archives and corpora that contain multiple languages. The aim of this research is to develop techniques targeted at producing a fast and more accurate automatic spoken LID system compared to the previous National Institute of Standards and Technology (NIST) Language Recognition Evaluation. Acoustic and phonetic speech information are targeted as the most suitable fea- tures for representing the characteristics of a language. To model the acoustic speech features a Gaussian Mixture Model based approach is employed. Pho- netic speech information is extracted using existing speech recognition technol- ogy. Various techniques to improve LID accuracy are also studied. One approach examined is the employment of Vocal Tract Length Normalization to reduce the speech variation caused by di®erent speakers. A linear data fusion technique is adopted to combine the various aspects of information extracted from speech. As a result of this research, a LID system was implemented and presented for evaluation in the 2003 Language Recognition Evaluation conducted by the NIST.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents speaker normalization approaches for audio search task. Conventional state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC) is known to contain speaker-specific and linguistic information implicitly. This might create problem for speaker-independent audio search task. In this paper, universal warping-based approach is used for vocal tract length normalization in audio search. In particular, features such as scale transform and warped linear prediction are used to compensate speaker variability in audio matching. The advantage of these features over conventional feature set is that they apply universal frequency warping for both the templates to be matched during audio search. The performance of Scale Transform Cepstral Coefficients (STCC) and Warped Linear Prediction Cepstral Coefficients (WLPCC) are about 3% higher than the state-of-the-art MFCC feature sets on TIMIT database.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In current methods for voice transformation and speech synthesis, the vocal tract filter is usually assumed to be excited by a flat amplitude spectrum. In this article, we present a method using a mixed source model defined as a mixture of the Liljencrants-Fant (LF) model and Gaussian noise. Using the LF model, the base approach used in this presented work is therefore close to a vocoder using exogenous input like ARX-based methods or the Glottal Spectral Separation (GSS) method. Such approaches are therefore dedicated to voice processing promising an improved naturalness compared to generic signal models. To estimate the Vocal Tract Filter (VTF), using spectral division like in GSS, we show that a glottal source model can be used with any envelope estimation method conversely to ARX approach where a least square AR solution is used. We therefore derive a VTF estimate which takes into account the amplitude spectra of both deterministic and random components of the glottal source. The proposed mixed source model is controlled by a small set of intuitive and independent parameters. The relevance of this voice production model is evaluated, through listening tests, in the context of resynthesis, HMM-based speech synthesis, breathiness modification and pitch transposition. © 2012 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are presented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation, and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10 × RT conversational speech transcription system. © 2005 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The second-order differential equations that describe the polyphase transmission line are difficult to solve due to the mutual coupling among them and the fact that the parameters are distributed along their length. A method for the analysis of polyphase systems is the technique that decouples their phases. Thus, a system that has n phases coupled can be represented by n decoupled single-phase systems which are mathematically identical to the original system. Once obtained the n-phase circuit, it's possible to calculate the voltages and currents at any point on the line using computational methods. The Universal Line Model (ULM) transforms the differential equations in the time domain to algebraic equations in the frequency domain, solve them and obtain the solution in the frequency domain using the inverse Laplace transform. This work will analyze the method of modal decomposition in a three-phase transmission line for the evaluation of voltages and currents of the line during the energizing process.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In continuum one-dimensional space, a coupled directed continuous time random walk model is proposed, where the random walker jumps toward one direction and the waiting time between jumps affects the subsequent jump. In the proposed model, the Laplace-Laplace transform of the probability density function P(x,t) of finding the walker at position at time is completely determined by the Laplace transform of the probability density function φ(t) of the waiting time. In terms of the probability density function of the waiting time in the Laplace domain, the limit distribution of the random process and the corresponding evolving equations are derived.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Human listeners can identify vowels regardless of speaker size, although the sound waves for an adult and a child speaking the ’same’ vowel would differ enormously. The differences are mainly due to the differences in vocal tract length (VTL) and glottal pulse rate (GPR) which are both related to body size. Automatic speech recognition machines are notoriously bad at understanding children if they have been trained on the speech of an adult. In this paper, we propose that the auditory system adapts its analysis of speech sounds, dynamically and automatically to the GPR and VTL of the speaker on a syllable-to-syllable basis. We illustrate how this rapid adaptation might be performed with the aid of a computational version of the auditory image model, and we propose that an auditory preprocessor of this form would improve the robustness of speech recognisers.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Several pixel-based people counting methods have been developed over the years. Among these the product of scale-weighted pixel sums and a linear correlation coefficient is a popular people counting approach. However most approaches have paid little attention to resolving the true background and instead take all foreground pixels into account. With large crowds moving at varying speeds and with the presence of other moving objects such as vehicles this approach is prone to problems. In this paper we present a method which concentrates on determining the true-foreground, i.e. human-image pixels only. To do this we have proposed, implemented and comparatively evaluated a human detection layer to make people counting more robust in the presence of noise and lack of empty background sequences. We show the effect of combining human detection with a pixel-map based algorithm to i) count only human-classified pixels and ii) prevent foreground pixels belonging to humans from being absorbed into the background model. We evaluate the performance of this approach on the PETS 2009 dataset using various configurations of the proposed methods. Our evaluation demonstrates that the basic benchmark method we implemented can achieve an accuracy of up to 87% on sequence ¿S1.L1 13-57 View 001¿ and our proposed approach can achieve up to 82% on sequence ¿S1.L3 14-33 View 001¿ where the crowd stops and the benchmark accuracy falls to 64%.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This thesis presents a universal model of documents and deltas. This model formalize what it means to find differences between documents and to shows a single shared formalization that can be used by any algorithm to describe the differences found between any kind of comparable documents. The main scientific contribution of this thesis is a universal delta model that can be used to represent the changes found by an algorithm. The main part of this model are the formal definition of changes (the pieces of information that records that something has changed), operations (the definitions of the kind of change that happened) and deltas (coherent summaries of what has changed between two documents). The fundamental mechanism tha makes the universal delta model a very expressive tool is the use of encapsulation relations between changes. In the universal delta model, changes are not always simple records of what has changed, they can also be combined into more complex changes that reflects the detection of more meaningful modifications. In addition to the main entities (i.e., changes, operations and deltas), the model describes and defines also documents and the concept of equivalence between documents. As a corollary to the model, there is also an extensible catalog of possible operations that algorithms can detect, used to create a common library of operations, and an UML serialization of the model, useful as a reference when implementing APIs that deal with deltas. The universal delta model presented in this thesis acts as the formal groundwork upon which algorithm can be based and libraries can be implemented. It removes the need to recreate a new delta model and terminology whenever a new algorithm is devised. It also alleviates the problems that toolmakers have when adapting their software to new diff algorithms.